A Compiler Toolchain for Distributed Data Intensive Scientific Workflows

نویسنده

  • Peter Bui
چکیده

by Peter Bui With the growing amount of computational resources available to researchers today and the explosion of scientific data in modern research, it is imperative that scientists be able to construct data processing applications that harness these vast computing systems. To address this need, I propose applying concepts from traditional compilers, linkers, and profilers to the construction of distributed workflows and evaluate this approach by implementing a compiler toolchain that allows users to compose scientific workflows in a high-level programming language. In this dissertation, I describe the execution and programming model of this compiler toolchain. Next, I examine four compiler optimizations and evaluate their effectiveness at improving the performance of various distributed workflows. Afterwards, I present a set of linking utilities for packaging workflows and a group of profiling tools for analyzing and debugging workflows. Finally, I discuss modifications made to the run-time system to support features such as enhanced provenance information and garbage collection. Altogether, these components form a compiler toolchain that demonstrates the effectiveness of applying traditional compiler techniques to the challenges of constructing distributed data intensive scientific workflows.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Simulation of Terabit Data Flows for Exascale Applications

Scientific workflows are increasingly drawing attention as both data and compute resources are getting bigger, heterogeneous, and distributed. Many science workflows are both compute and data intensive and use distributed resources. This situation poses significant challenges in terms of real-time remote analysis and dissemination of massive datasets to scientists across the community. These ch...

متن کامل

Editorial : Scientific Workflows , Provenance and Their Applications

Scientific workflows play a crucial role in modern eScience [5] where many significant scientific discoveries are achieved through complex and distributed computations. For many scientists in the Life Sciences, in bioinformatics, geosciences, chemistry, physics, and numerous other domains, scientific workflows have become an enabling technology to formalize and automate complex and data intensi...

متن کامل

Trustworthy and Dynamic Mobile Task Scheduling in Data-Intensive Scientific Workflow Environments

There is an increasing demand for data-intensive applications in which scientists use scientific workflows to integrate together data management, analysis, simulation and visualization services over often voluminous complex and distributed scientific data and services. One major limitation of current scientific workflow models is that each workflow task is stationary, requiring a dataset to be ...

متن کامل

Performance Database: Capturing Data for Optimising Distributed Streaming Workflows

It is evident that data-intensive research is transforming the computing landscape, as recognised in “The Fourth Paradigm” [1]. Due to the scale, complexity and heterogeneity of data gathered in scientific experiments, we can not naively dumping the data into computing resources and hoping to extract useful information and knowledge through exhaustive and unstructured computations. To survive t...

متن کامل

Parallelizing XML data-streaming workflows via MapReduce

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the Map-Reduce framework. Pipelines in our approach consist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012